class: center, middle, inverse, title-slide # Lecture 3 ## Summary Statistics ### Psych 10 C ### University of California, Irvine ### 03/30/2022 --- ## Summary Statistics - Another common to summarize information from experimental or survey data is by using a statistic. -- - Statistics are **functions** of **random variables** in an experiment that can be used to convey information about the location of our observations and how they vary. -- - Not all the variables in an experiment are equally important, so we don't usually look for ways to visualize them, but we still want to make sure that we gather all the information we can about our sample. -- - In that case we can use summary statistics to report some of their properties. --- class: inverse, center, middle # Random Variables and Functions --- ## Random variables - Statisticians are very bad at naming things ... -- - When you think of the words "random" and "variable", what comes to mind first? -- - The formal definition is the opposite! -- - **Definition:** A random variable is a function of the outcomes of an experiment. --- # Functions - Functions have formal definitions, however, what's really important is that we remember how they "work". -- - Intuitively, we can think of functions as rules regarding how two groups of "things" are associated. -- - For example, imagine we throw a coin and record whether it ends up being heads or tails. A function could be a rule that states: - `\(x = 0\)` if the outcome is tails and `\(x = 1\)` if the outcome is heads. -- - This is a simple rule that lets us assign numbers to a variable `\(x\)` depending on the result of a coin toss. In other words, `\(x\)` is defined as a function of the outcome of the experiment. --- ## Functions - Another simple function would be `\(y = x + 1\)`. - This function tells us that whatever the value of `\(x\)` is, we can get the value of `\(y\)` by adding `\(1\)` to `\(x\)`. -- - Regardless of how complex they look, we can always think of functions as a "map" that specifies how to go from the values of one variable to values on a second variable. --- ## Back to Random Variables - Random variables are neither random nor variables, they are simply the rules we use to assign numbers to the outcomes of an experiment. -- - In other words random variables are deterministic functions (See? statisticians are really bad at naming things!). -- - In our previous example with the coin toss, `\(x\)` can be considered a random variable, as there's a rule that assigns a numeric value (0 or 1) to the outcome of the experiment (heads or tails). --- ## Example with the memory experiment. - Let's think back to our memory experiment. For any participant taking any test (say, participant 1 on test 1), each time we presented a word they could have indicated that the word was either on the original list or not. -- - Simultaneously, each word we presented was either on the test or it wasn't. -- - Since we don't know how any person would respond to any word, we can treat the responses registered as being probabilistic. -- - Now we can create a random variable that says: -- - if the word was on the original list **and** the participant responds that the word was on the original list, then `\(x = 1\)`. -- - if the word was in the original list **and** the participant responds that the word was **not** in the original list, then `\(x = 0\)` -- - If we record the value of `\(x\)` corresponding to every trial for a single participant we would end up with 50 different values that indicate, whether the response registered for that trial's word was correct or not. --- ## Statistics - The examples we have talked about are all Statistics, a statistic is just a function of our sample (data). -- - In our memory experiment we don't have the record of the responses of each participant to each word, we have something "simpler". -- - We have another random variable that adds all the correct responses. -- - We have lost some information while doing so. Can you guess what information has been lost? -- - We made a trade between the information about the order of the correct responses in exchange for a summary of the experiment, the total number of correct responses. --- ## Statistics - Every time we use a statistic (function of our experimental outcomes) we either: -- - keep the same information (for example assigning a value of 1 to a heads in the coin toss) -- - Lose information (for example when we take the number of correct responses in the memory experiment) -- - In the majority of the examples of this course this will not be a problem, however, it is important to keep this loss of information in mind. --- class: inverse, center, middle # Commonly Used Statistics ## The Mean --- ## Mean - As you know from Psych 10 B, one of the properties of a r.v. that we are interested in is its expected value. -- - This value can be calculated with the formula: `$$\mathbb{E}(x) = \sum_x x \ p(x)$$` -- - We are faced with a problem here as, when we gather data form an experiment, we don't know the probability of each of the values of our random variable. -- - In other words we don't know `\(p(x)\)`. -- - For example, what is the probability that a participant has 40 correct responses? --- ## Mean - Fortunately, we can mathematically prove that the **average** of a random variable will be close to the expected value. -- - This is true regardless of the values of `\(p(x)\)`. -- - Of course this is just an approximation and will therefore be prone to error. -- - But it will be our best guess! -- - Calculating the average is simple: `$$\bar{x} = \sum_{i = 1}^n \frac{x_i}{n}$$` -- - Here we use `\(x_i\)` to indicate each of our observations (remember that in the memory experiment we have 50 0's or 1's). The variable n represents the total number of observations. --- # Example: average number of correct responses - Let's go back to the memory example, and look at the mean age of our participants. -- ```r mean_age <- memory %>% summarise("bar_age" = mean(age)) %>% pull(bar_age) ``` -- - We can see look at the mean age of our participants by typing the name of the variable in the console: ```r mean_age ``` ``` [1] 36.68 ``` --- # Note: - For your homework you will need to have average values show in text, an easy way to do this is using the following code in the text: -- The mean age of the participants on the experiment was ``` ` r name-of-variable` ``` -- - which will be printed on the pdf as: The mean age of the participants on the experiment was 36.68. --- # Mean of more than one group - Using the function `group_by()` in the `tidyverse` package we can group our observations according to some variable and then get summaries. -- - In our memory example, if we want to calculate the mean number of correct responses by test condition (test-1 vs test-2) we can use the following code: ```r mean_test <- memory %>% group_by(test_id) %>% summarise("mean" = mean(correct)) ``` -- - We don't use the pull function in this case because we need to know the value for each test. --- # Mean of more than one group - We can look at the result by typing the name of our variable into the console: ```r mean_test ``` ``` # A tibble: 2 x 2 test_id mean <chr> <dbl> 1 test_1 44.9 2 test_2 37.1 ``` -- - we can also look at the results one at a time: .pull-left[ - test 1: ```r mean_test$mean[1] ``` ``` [1] 44.94 ``` ] .pull-right[ - test 2: ```r mean_test$mean[2] ``` ``` [1] 37.12 ``` ] --- class: inverse, center, middle # Commonly Used Statistics ## Sample variance --- # Variance - Another important property of a random variable is called variance. -- - Definition: variance is the expected (squared) distance of a random variable to its expected value. -- - Notice that this is an expectation with respect to another expectation. -- - It is formally defined as: `$$\mathbb{V}ar(x) = \mathbb{E}[(x - \mathbb{E}(x))^2]$$` -- - We can look at it as ant other expectation and get something like this: `$$\mathbb{E}[(x - \mathbb{E}(x))^2] = \sum_x (x - \mathbb{E}(x))^2\ p(x)$$` - This looks very similar to our definition of an expected value so maybe we can do something similar and find a good approximation. --- # Sample variance - From the formal definition we know that we are missing two parts `$$\mathbb{E}[(x - \mathbb{E}(x))^2] = \sum_x (x - \mathbb{E}(x))^2\ p(x)$$` -- - The probability of the outcomes `\(p(x)\)` and the expected value of `\(x\)`. -- - But we already have a good approximation to the expected value, the mean! -- - To find an approximation to the variance we can take the average of the distance of our observations to the mean. -- - This is called the sample variance and it is a relatively good approximation to the real variance: `$$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n}$$` --- # Note: - In other classes you might see the definition of the sample variance using an `\(n-1\)` in the denominator: `$$s^2 = \sum_{i=1}^{n} \frac{(x_i - \bar{x})^2}{n-1}$$` -- - This is a better approximation and indeed is the one that R calculates by default (and the one you need to use on homework 1). -- - However, for the rest of the class we will use the definition that uses `\(n\)` in the denominator. --- ## Sample variance - We can get the sample variance using a similar code as with the mean, for example the variance of the age of our participants can be obtained with: ```r var_age <- memory %>% summarise("var_age" = var(age)) %>% pull(var_age) ``` -- - Remember that the variance is in squared unites! so it's not that easy to interpret. -- - The variance in participants age was 212.278995. --- # Sample variance - We can also calculate the variance of the number of correct responses by test in our memory experiment. -- - The code will be almost the same: ```r var_test <- memory %>% group_by(test_id) %>% summarise("variance" = var(correct)) ``` - we can also look at the results one at a time: .pull-left[ - test 1: ```r var_test$variance[1] ``` ``` [1] 4.541818 ``` ] .pull-right[ - test 2: ```r var_test$variance[2] ``` ``` [1] 29.09657 ``` ]